This project aims at identifying predictors of health outcomes using socioeconomic factors. Specifically, income, housing costs, food security, and food costs will be tested for association with surveyed outocmes from Behavioral Risk Factor Surveillance System (BRFSS) epidemiologic data. BRFSS data is tracked at the metropolitan level, and SMART BRFSS 2017 data was downloaded as XPT file here Measures of mental and physical health will be analyzed using secondary data from the American Community Survey (ACS), Housing Urban Development (HUD) fair market rent (fmr) data, Feeding America dataset. The main faculty for this project is Blanca Himes who gave project advice and provided a reference paper (Leszinsky et al.) and Sherry Xie, who assisted with geospatial plotting and formatting.
Describe the problem addressed, its significance, and some background to motivate the problem.
Wealth inequality has been increasing in the United States for several decades. One estimate by Congressional Budget Office (CBO) estimates a 20% increase in income inequality between 1980-2016. This is seen as increased wealth accumulation for the high-earning Americans (top 1%) who have seen their share of aggregate income increase by nearly 20% as middle-class americans saw a decreased share. Despite a steady increase in the Gross Domestic Product (GDP), real median income has not risen at a similar rate. This suggests that increased productivity in the United States has not led to increased earnings for most Americans over the past four decades. In addition to wage stagnation, the cost of living in the Country has risen as well. For most people, the largest share of expenses is taken by housing costs, accounting for 30-40% of an average American’s monthly budget. According to the Center on Budget and Policy Priorities (CBPP), median rents across the nation have risen by 13% between 2001-2008, far outpacing median household income growth. Thus, on average, a larger proportion of American incomes are being spent on necessities. Associations between income and health have been intensly studied across many economic, social policy, and epidmiologic fields. One study looked at National Health and Nutrition Examination Survey (NHANES) data which is taken from a survey conducted by the CDC. The authors investigated how survey participants responsed to questions about their general health. The study finds that those in households with high income self-reported “good heath” more frequently than those in middle to low income household. The authors also found that increased income correlates to decreased “stress load” as measured as a weighted average of 5 biomarkers associated with stress such as blood pressure and creatine levels. Overall, it is clear that an individual’s economic status strongly correlates to health outcomes. High income individuals have lower stress and better general health. Chronic stressors such as food insecurity, high costs-of-living and lack of access to preventivive/primary care all contribute to health disparities by income. Based on these findings, I wanted to combine individual demographics (including income) with aggregate data to better predict several health outcomes from the BRFSS. I wanted to see how median metropolitan statitics of poverty and income predict outcomes compared to invidiual income statistics and if combining the two increases predicitive power. Additionally, I wanted to include median rent data and food expense data, as these are two of the largest expenses in any given household. Thus, I would expect that health outcomes might vary depending on both income and cost-of-living as a result of increased financial insecurity.
This project is interdisciplinary because combines economic data with health survey data. I wanted to see if BRFSS data can be combined with metro-level cost-of-living factors to better predict health outcomes compared to individual demographics alone. I will use data from multiple sources including the CDC, BLS, HUD, ACS, and Feeding America to accomplish these goals.
Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.
For exploring growth rate, I looked at change in the consumer price index (CPI) between 2018-2020 for major metropolitan areas in the United States. This data was taken from the Bureau of Labor Statistics (BLS) To look at county-level income and poverty data, I query 5-year ACS data from 2014-2018 for poverty rate and median income. I also collected HUD fmr rate for the past from 2017-2020. Finally I use the “Map the Gap” calculated county-level meal cost from the non-profit Feeding America to estimate food expense. This cost-of-food index is calculated by applying county-level sales tax rates to Nielsen basket-of-goods based on actual pounds purchased. This “meal cost” value was last calculated in 2010 by Feeding America. Necessary corrections to nominal housing costs for inflation were corrected using CPI values from 2010-2020. BRFSS data is ta
In the first section, I explore economic factors across the country either at a metroplotian level or by county level. Using geospatioal maps, explore income, rent, and percent of rent spent on income across the country. I also looked at how the inflation rate (marker for economic growth) varies by major US cities. I also explore county-level poverty and income rates. Finally I make some plots checking the assocition between poverty and income as well as poverty and costs-of-living (rent, food expense).
Finally, I load the 2017 BRFSS dataset and clean the dataframe. I check how aggregate factors (by Metro Area) affect health outcomes (general health, COPD history, and history of depression). I also look at how individual income alone affects these same outcomes. For these analyses I use a combination of plots and Chi-squared tests to check for independence. I create a logistic regression model combining BRFSS demographic factors (age, sex, race, income) to predict these outcomes. Finally, I add aggregate “cost-of-living” data to this logistic regression model to see if predictive power improves.
Download and install Rtools40 to use the BLS api: https://cran.r-project.org/bin/windows/Rtools/
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(gtsummary)
# tools for accessing BLS data
# install_github("mikeasilva/blsAPI", force = TRUE) # Install the blsAPI by Mike Silva
library(devtools)## Loading required package: usethis
# query data from BLS.gov
# load api key (must have the txt file)
api_key <- read.csv("bls_api_key.txt", header=FALSE)[[1]]
# signatures to be queried from BLS.gov, 10-year CPI data from major metropolitan areas
signatures <- read.csv("metro_area_cpi_signatures.csv")
# payload signatures to query BLS
month_payload <- list('seriesid'= signatures$signature_monthly) #2018-2020 monthly CPI
annual_payload <- list('seriesid'= signatures$signature_annual) #2018-2020 annual CPI
# query CPI 2018-2020 data from major metropolitan areas from BLS
months_cpi <- blsAPI(month_payload, return_data_frame = TRUE) # monthly CPI dataframe
annual_cpi <- blsAPI(annual_payload, return_data_frame = TRUE) # annual CPI dataframe# add columns to join region labels by
months_cpi$signature_monthly <- months_cpi$seriesID
annual_cpi$signature_annual <- annual_cpi$seriesID
# join region and state labels to the cpi data
# complete monthly dataframe for 2018-2020 regional metro cpi data
months_cpi %<>%
left_join(signatures[, c("signature_monthly","state","area")],by="signature_monthly") %>%
dplyr::select(-seriesID)
# complete annual dataframe for 2018-2020 regional metro cpi data
annual_cpi %<>%
left_join(signatures[, c("signature_annual","state","area")],by="signature_annual") %>%
dplyr::select(-seriesID)
head(months_cpi)## year period periodName value signature_monthly state area
## 1 2020 M10 October 260.892 CUURS35ESA0 MD Baltimore-Columbia-Towson
## 2 2020 M08 August 259.336 CUURS35ESA0 MD Baltimore-Columbia-Towson
## 3 2020 M06 June 257.942 CUURS35ESA0 MD Baltimore-Columbia-Towson
## 4 2020 M04 April 258.978 CUURS35ESA0 MD Baltimore-Columbia-Towson
## 5 2020 M02 February 259.121 CUURS35ESA0 MD Baltimore-Columbia-Towson
## 6 2019 M12 December 257.847 CUURS35ESA0 MD Baltimore-Columbia-Towson
## year period periodName value signature_annual state area
## 1 2020 S01 1st Half 258.891 CUUSS35ESA0 MD Baltimore-Columbia-Towson
## 2 2019 S02 2nd Half 257.288 CUUSS35ESA0 MD Baltimore-Columbia-Towson
## 3 2019 S01 1st Half 256.485 CUUSS35ESA0 MD Baltimore-Columbia-Towson
## 4 2018 S02 2nd Half 254.382 CUUSS35ESA0 MD Baltimore-Columbia-Towson
## 5 2018 S01 1st Half 252.401 CUUSS35ESA0 MD Baltimore-Columbia-Towson
## 6 2020 S01 1st Half 244.889 CUUSS35CSA0 GA Atlanta-Sandy Springs-Roswell
# Further format dataframes
# pivot data by metro area and years, add column for mean annual CPI
# annual cpi
a_cpi <- annual_cpi %>%
pivot_wider(id_cols = c("area","year"),names_from = "periodName",
values_from = "value")%>% # pivot on area and year using values from period
mutate_at(vars(3:4), as.numeric) %>% # change CPI values to numeric
mutate(annual = rowMeans(select(., c("1st Half", "2nd Half")), na.rm= TRUE))# mean annual cpi
# monthly cpi
m_cpi <- months_cpi %>%
dplyr::select(-period) %>%
pivot_wider(id_cols = c("area","year"), names_from = "periodName",
values_from = "value") %>% # pivot on area and year using values from period
mutate_at(vars(2:14), as.numeric) %>% # change CPI values to numeric
mutate(annual = rowMeans(select(.,-c("area", "year")), na.rm = TRUE)) # mean annual cpi
head(a_cpi)## # A tibble: 6 x 5
## area year `1st Half` `2nd Half` annual
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Baltimore-Columbia-Towson 2020 259. NA 259.
## 2 Baltimore-Columbia-Towson 2019 256. 257. 257.
## 3 Baltimore-Columbia-Towson 2018 252. 254. 253.
## 4 Atlanta-Sandy Springs-Roswell 2020 245. NA 245.
## 5 Atlanta-Sandy Springs-Roswell 2019 242. 246. 244.
## 6 Atlanta-Sandy Springs-Roswell 2018 238. 239. 239.
## # A tibble: 6 x 15
## area year October August June April February December September July May March January November annual
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Baltimore-Columbia-Towson 2020 261. 259. 258. 259. 259. NA NA NA NA NA NA NA 259.
## 2 Baltimore-Columbia-Towson 2019 258. 257. 257. 259. 254. 258. NA NA NA NA NA NA 257.
## 3 Baltimore-Columbia-Towson 2018 255. 255. 254. 252. 252. 253. NA NA NA NA NA NA 254.
## 4 Atlanta-Sandy Springs-Roswell 2020 249. 248. 245. 243. 247. NA NA NA NA NA NA NA 246.
## 5 Atlanta-Sandy Springs-Roswell 2019 246. 246. 243. 243. 240. 245. NA NA NA NA NA NA 244.
## 6 Atlanta-Sandy Springs-Roswell 2018 239. 241. 240. 237. 237 237. NA NA NA NA NA NA 239.
a_cpi_growth_rate <- as.data.frame(a_cpi) %>%
dplyr::select(area, annual) %>%
dplyr::group_by(area) %>%
dplyr::summarise(growth = (max(annual)-min(annual))/min(annual)*100) #growth from 2018-2020## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(a_cpi_growth_rate, aes(x = reorder(area, -growth), y = growth))+
geom_col() +
theme(axis.text.x = element_text(angle = 90)) +
theme(axis.title.x = element_blank())+
geom_hline(yintercept=2.2, linetype="dashed", color = "red")+
ylab("Growth (% CPI Change 2018-2020)")## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: grid
library(stringr)
library(grid)
# function to format the geo id column of FMR dataframes
format_id <- function(df){
df$ID <- str_replace(as.character(df$fips2010),
as.character(df$CouSub),"")
df$ID <- as.numeric(df$ID) + 50000000000
df$ID <- paste0("0",as.character(df$ID))
return(df)
}
# read in the housing costs: fair market rates (2017-2020)
# format relevent columns into a standard form
fmr_2017 <- read.csv("FY2017_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
dplyr::mutate_at(c("fmr0", "fmr1", "fmr2", "fmr3", "fmr4"), as.numeric)
fmr_2018 <- read.csv("FY2018_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2",
"fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code")
fmr_2019 <- read.csv("FY2019_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
dplyr::rename("CouSub" = "cousub","fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2",
"fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code")
fmr_2020 <- read.csv("FY2020_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
dplyr::mutate_at(c("fmr_0", "fmr_1", "fmr_2", "fmr_3", "fmr_4"), as.numeric) %>%
dplyr::rename("CouSub" = "cousub", "fmr0" = "fmr_0","fmr1" = "fmr_1","fmr2" = "fmr_2",
"fmr3" = "fmr_3", "fmr4" = "fmr_4", "Metro_code" = "metro_code")
fmr_2021 <- read.csv("FY2021_50_County_FMR.csv", fileEncoding = "UTF-8-BOM") %>%
dplyr::mutate_at(c("rent50_0", "rent50_1", "rent50_2", "rent50_3", "rent50_4"),
as.numeric) %>%
dplyr::rename("CouSub" = "cousub", "fmr0" = "rent50_0",
"fmr1" = "rent50_1","fmr2" = "rent50_2",
"fmr3" = "rent50_3", "fmr4" = "rent50_4",
"Metro_code" = "cbsasub21", "areaname" = "areaname21",
"countyname" = "cntyname")
# read in shape file of US counties
counties <- readRDS(gzcon(url('https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds'))) # decompress rds file and read in shape file
counties$GEO_ID <- as.character(counties$GEO_ID)#change geo id to characters in counties df
counties$ID <- gsub("US","", counties$GEO_ID) #remove US from ID
# read in metro area shape file and convert into spatial components
setwd("tl_2017_us_cbsa")
cbsa.sf <- st_read("tl_2017_us_cbsa.shp")## Reading layer `tl_2017_us_cbsa' from data source `C:\Users\chana\Box Sync\Graduate School\Courses\Fall 2020\BMIN 503\Final Project\BMIN503_Final_Project\tl_2017_us_cbsa\tl_2017_us_cbsa.shp' using driver `ESRI Shapefile'
## Simple feature collection with 945 features and 12 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -178.4436 ymin: 17.83151 xmax: -65.42271 ymax: 65.45352
## geographic CRS: NAD83
cbsa.sf %<>%
dplyr::select(GEOID, geometry,NAME, NAMELSAD) %>%
dplyr::rename(ID2 = GEOID, Metro = NAMELSAD)
# function to merge FMR dataframes with county geometries
geo_format <- function(geo_df){
fmr_geo <- counties %>%
dplyr::select(c("NAME", "ID", "geometry")) %>%
left_join(geo_df[,c("ID","fmr0", "fmr1", "fmr2", "fmr3", "fmr4",
"Metro_code", "areaname", "countyname")], by="ID") %>%
distinct(Metro_code, .keep_all = TRUE)
}
# function to merge FMR dataframes with metro area geometries
geo_format2 <- function(geo_df){
geo_df %<>%
dplyr::select(ID,fmr0,fmr1,fmr2, fmr3, fmr4, Metro_code, areaname, countyname)%>%
distinct(Metro_code, .keep_all = TRUE) %>% # removed repeated counties
dplyr::mutate(ID2 = substr(Metro_code, 12,16)) # metro IDs
fmr_geo <- cbsa.sf %>%
dplyr::select(c("NAME", "ID2", "geometry", "Metro")) %>%
left_join(geo_df, by="ID2") # join by metro area
}
# format geographical IDs for all FMRs 2017-20201
fmr_2017 <- format_id(fmr_2017)
fmr_2018 <- format_id(fmr_2018)
fmr_2019 <- format_id(fmr_2019)
fmr_2020 <- format_id(fmr_2020)
fmr_2021 <- format_id(fmr_2021)
# add county geographical data to all FMRs 2017-20201
fmr_2017_geo <- geo_format(fmr_2017)
fmr_2018_geo <- geo_format(fmr_2018)
fmr_2019_geo <- geo_format(fmr_2019)
fmr_2020_geo <- geo_format(fmr_2020)
fmr_2021_geo <- geo_format(fmr_2021)
# add metro area geographical data to all FMRs 2017-20201
fmr_2017_geo2 <- geo_format2(fmr_2017)
fmr_2018_geo2 <- geo_format2(fmr_2018)
fmr_2019_geo2 <- geo_format2(fmr_2019)
fmr_2020_geo2 <- geo_format2(fmr_2020)
fmr_2021_geo2 <- geo_format2(fmr_2021)library(RColorBrewer)
library(leaflet)
my_theme <- function() {
theme_minimal() + # shorthand for white background color
theme(axis.line = element_blank(), # further customization of theme components
axis.text = element_blank(), # remove x and y axis text and labels
axis.title = element_blank(),
panel.grid = element_line(color = "white"), # make grid lines invisible
legend.key.size = unit(0.8, "cm"), # increase size of legend
legend.text = element_text(size = 16), # increase legend text size
legend.title = element_text(size = 16),
plot.title = element_text(hjust = 0.5)) # increase legend title size
}
myPalette <- colorRampPalette(brewer.pal(9, "BuPu")) # Blue-Purple
myPalette2 <- colorRampPalette(brewer.pal(9, "YlOrRd")) # Yellow-Orange-Red
myPalette3 <- colorRampPalette(brewer.pal(9, "YlGnBu")) # Yellow-Green-Blue
# geographic visualizations of housing costs
cpi_by_year <- read.csv("annual_cpi_us.csv", fileEncoding = "UTF-8-BOM")
cpi_by_year$adjust <- cpi_by_year$cpi/258.2937 # adjustment factor into 2020 dollars
# use the cpi_by_year adjustment to adjust rents to 2010 dollars
# rent in y year = (rent in current year) X (CPI y year/CPI current year)
# Select a color palette with which to run the palette function
pal_fun <- colorNumeric("BuPu", NULL) # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL) # Yellow-Orange-Red from RColorBrewer
# function to get inflation correction factor for a given year into 2020 dollars
i_factor <- function(year){
c_factor <- cpi_by_year$adjust[which(cpi_by_year$year == year)]
return(c_factor)
}
# function to generate static map of rent costs by year across all coutnies in US
map_rent <- function(df, year){
correction = i_factor(year)
fmr_map = ggplot() +
geom_sf(data=df,lwd=0,
aes(fill = fmr1/correction)) + # adjust factor included
ggtitle(paste(year, "Median Rent for 1-bed")) +
scale_fill_gradientn(name = "Rent (2020 $)",
colours = myPalette2(100),
limits = c(0,2000),
breaks = c(500,1000,1500,2000)) + # add limits and breaks to normalize scaling between years
north(data = df, symbol = 12, location = "bottomright") +
scalebar(data = df, dist = 500, dist_unit = 'mi', transform = TRUE,
location = "bottomleft",
st.dist = 0.1) +
my_theme()
}
# call static map function and display
fmr1_2017_map <- map_rent(fmr_2017_geo, 2017) # generate static rent map for counties
fmr1_2017_map # display rent map# call static map function and display
fmr1_2017_map2 <- map_rent(fmr_2017_geo2, 2017) # generate static rent map for metro areas
fmr1_2017_map2 # display rent map# an interactive map of rent prices
library(htmlwidgets)
library(htmltools)
pal_fun <- colorNumeric("YlOrRd", NULL) # Yellow-Orange-Red from RColorBrewer
# ------------------------------------------------- #
# HTML code to fix misaligned "na" in map legend
# ------------------------------------------------- #
library(htmlwidgets)
css_fix <- "div.info.legend.leaflet-control br {clear: both;}" # CSS to correct spacing
html_fix <- htmltools::tags$style(type = "text/css", css_fix) # Convert CSS to HTML
# Funciton allows to plot interactive leaflet maps of median US rents by year
map_rent_i <- function(df, year){
#---------------------------------------------------#
# HTML code to add title to interactive maps
#---------------------------------------------------#
tag.map.title <- tags$style(HTML("
.leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 50%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
color: #000000;
font-size: 20px;
}
"))
title <- tags$div(
tag.map.title, HTML(paste(year, "Median Rent:", 1, "Bed"))
)
correction = i_factor(year)# find correction factor
# Pop-up message
pu_message <- paste0("County: ", df$countyname,
"<br>Rent: ", "$", floor(df$fmr1/correction)) # in 2010 dollars
# Adding more customization
interactive_map <- leaflet(df) %>%
addPolygons(stroke = FALSE, # remove polygon borders
fillColor = ~pal_fun(fmr1/correction), # in 2010 dollars
fillOpacity = 0.5, smoothFactor = 0.5, # increase opacity and resolution
popup = ~pu_message) %>%
addProviderTiles(providers$CartoDB.Positron) %>% # add third party provider tile
#addProviderTiles(providers$Stamen.Toner) %>%
#addProviderTiles(providers$Esri.NatGeoWorldMap)
addLegend("bottomright", # location of legend
pal=pal_fun, # palette function
values=~fmr1/correction, # variable to be passed to palette function
title = paste("Rent",1,"bed (2020 $)"), # legend title
opacity = 1) %>% # legend opacity (1 = completely opaque)
addScaleBar() %>%
addTiles() %>%
addControl(title, position = "topleft", className="map-title") #add title panel
return(interactive_map %<>% htmlwidgets::prependContent(html_fix))
}
# Call interactive map function and display
fmr_2020_mapi <- map_rent_i(fmr_2020_geo, 2020)
fmr_2020_mapilibrary(tidycensus)
library(tidyr)
census_key <- read.delim("census_key.txt", header=FALSE)[1,1] # read in api key from local
# census_api_key(census_key,install=TRUE)acs.data <- get_acs(geography = "county", # query data at the county level
year = 2018, # end year (these will give us ACS 5-year estimates for 2014-2018)
variables = c("B17010_002", # number of families falling below the poverty threshold
"B17010_001", #total number of families for which poverty was determined
"B19013_001"), # median household income
key = census_key) ## Getting data from the 2014-2018 5-year ACS
acs.data %<>%
dplyr::select(-moe) #remove the margin of error
# pivot on both estimate and total variables
acs.data <- pivot_wider(data=acs.data, names_from="variable", values_from="estimate")
acs.data %<>%
dplyr::rename(poverty_total = B17010_001,
poverty_estimate = B17010_002,
med_house_income = B19013_001) %>% # give vars meaningful names
dplyr::mutate(ID = paste0("0", as.character(050000000000 + as.numeric(GEOID)))) #create ID column to merge on# geospatial exploration of ACS median income from 2014-2018 by county
# join acs data
acs.data.geo <- left_join(counties, acs.data[,c("ID", "poverty_total",
"poverty_estimate", "med_house_income")],
by = "ID") # merge to counties
acs.data.geo %<>% # clean up the dataframe
dplyr::select(GEO_ID, ID, COUNTY, poverty_total, poverty_estimate,
med_house_income,geometry)
# plot income map
income_map <- ggplot() +
geom_sf(data=acs.data.geo,lwd=0,
aes(fill = med_house_income)) + # adjust factor included
ggtitle("Median Household Income (2014-2018)") +
scale_fill_gradientn(name = "Income",
colours = myPalette2(100)) + # add limits and breaks to normalize scaling between years
north(data = acs.data.geo, location = "bottomright", symbol = 12) +
scalebar(data = acs.data.geo, dist = 500, dist_unit = 'mi', transform = TRUE,
location = "bottomleft",
st.dist = 0.1) +
my_theme()
income_map # show the median income maplibrary(ggpubr)
options(scipen=10000)
# joins acs data and fmr data for any given fmr year
acs_join <- function(acs.geo.df, fmr.geo.df){
acs.df <- dplyr::select(as.data.frame(acs.data.geo), -geometry)
acs_fmr <- left_join(fmr.geo.df, acs.df, by = "ID")
return(acs_fmr)
}
# combine acs data with fmr data
acs.fmr.2017 <- acs_join(acs.data.geo, fmr_2017_geo)
acs.fmr.2018 <- acs_join(acs.data.geo, fmr_2018_geo)
acs.fmr.2019 <- acs_join(acs.data.geo, fmr_2019_geo)
acs.fmr.2020 <- acs_join(acs.data.geo, fmr_2020_geo)
acs.fmr.2021 <- acs_join(acs.data.geo, fmr_2021_geo) # calculate poverty rate and percent spent on housing by county for one year
thresh <- 0 # threshold for filtering poverty estimates
acs.fmr.2017 %<>%
dplyr::filter(poverty_estimate > thresh) %>%
dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2018 %<>%
dplyr::filter(poverty_estimate > thresh) %>%
dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2019 %<>%
dplyr::filter(poverty_estimate > thresh) %>%
dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
acs.fmr.2020 %<>%
dplyr::filter(poverty_estimate > thresh) %>%
dplyr::mutate(poverty_percent = poverty_estimate/poverty_total) %>%
dplyr::mutate(housing_percent = (fmr1*12)/med_house_income)
# plot map of percentage spent on median 1-bed apartment by median income
rent_income_map <- ggplot() +
geom_sf(data=acs.fmr.2017,lwd=0,
aes(fill = housing_percent)) + # adjust factor included
ggtitle("Percent of income spent on rent (2017 1-Bed)") +
scale_fill_gradientn(name = "% spent on rent",
colours = myPalette2(10)) + # add limits and breaks to normalize scaling between years
north(data = acs.fmr.2017, location = "bottomright", symbol = 12) +
scalebar(data = acs.fmr.2017, dist = 500, dist_unit = 'mi', transform = TRUE,
location = "bottomleft",
st.dist = 0.1) +
my_theme()
rent_income_map # show the median income map# relationship between amount spent on housing and poverty rate
# FMR 1-bed data from 2017-2020
# Median income estimated from 2014-2018 ACS data
# generate linear regression fits for the relationship between housing burdern and poverty
# looking at income alone across US counties
income_2017 <- ggplot(data = acs.fmr.2017, aes(x=med_house_income, y=poverty_percent))+
geom_point() +
xlab("Median Income ($)") +
ylab("Poverty Rate %") +
ggtitle("2017 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
income_2018 <- ggplot(data = acs.fmr.2018, aes(x=med_house_income, y=poverty_percent))+
geom_point() +
xlab("Median Income ($)") +
ylab("Poverty Rate %") +
ggtitle("2018 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
income_2019 <- ggplot(data = acs.fmr.2019, aes(x=med_house_income, y=poverty_percent))+
geom_point() +
xlab("Median Income ($)") +
ylab("Poverty Rate %") +
ggtitle("2019 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
income_2020 <- ggplot(data = acs.fmr.2020, aes(x=med_house_income, y=poverty_percent))+
geom_point() +
xlab("Median Income ($)") +
ylab("Poverty Rate %") +
ggtitle("2020 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
combined_income <- cowplot::plot_grid(income_2017,income_2018,
income_2019,income_2020)
combined_income# looking at housing cost alone across US counties
housing_2017 <- ggplot(data = acs.fmr.2017, aes(x=fmr1, y=poverty_percent))+
geom_point() +
xlab("Median Rent (1-Bed)") +
ylab("Poverty Rate %") +
ggtitle("2017 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
housing_2018 <- ggplot(data = acs.fmr.2018, aes(x=fmr1, y=poverty_percent))+
geom_point() +
xlab("Median Rent (1-Bed)") +
ylab("Poverty Rate %") +
ggtitle("2018 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
housing_2019 <- ggplot(data = acs.fmr.2019, aes(x=fmr1, y=poverty_percent))+
geom_point() +
xlab("Median Rent (1-Bed)") +
ylab("Poverty Rate %") +
ggtitle("2019 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
housing_2020 <- ggplot(data = acs.fmr.2020, aes(x=fmr1, y=poverty_percent))+
geom_point() +
xlab("Median Rent (1-Bed)") +
ylab("Poverty Rate %") +
ggtitle("2020 US Counties ") +
theme(plot.title = element_text(hjust = 0.5))
combined_housing <- cowplot::plot_grid(housing_2017, housing_2018,
housing_2019, housing_2020)
combined_housing# looking at income and housing cost combined across US counties
h_burden_2017 <- ggplot(data = acs.fmr.2017, aes(x=housing_percent, y=poverty_percent))+
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
xlab("Percent Spent on Rent") +
ylab("Poverty Rate %") +
ggtitle("2017 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
h_burden_2018 <- ggplot(data = acs.fmr.2018, aes(x=housing_percent, y=poverty_percent))+
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
xlab("Percent Spent on Rent") +
ylab("Poverty Rate %") +
ggtitle("2018 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
h_burden_2019 <-ggplot(data = acs.fmr.2019, aes(x=housing_percent, y=poverty_percent))+
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
xlab("Percent Spent on Rent") +
ylab("Poverty Rate %") +
ggtitle("2019 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
h_burden_2020 <-ggplot(data = acs.fmr.2020, aes(x=housing_percent, y=poverty_percent))+
geom_point() +
geom_smooth(method = "lm", formula = y~x) +
xlab("Percent Spent on Rent") +
ylab("Poverty Rate %") +
ggtitle("2020 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
combined_h_burden <- cowplot::plot_grid(h_burden_2017, h_burden_2018,
h_burden_2019, h_burden_2020)
combined_h_burden # display the combined graphsetwd("Map the Meal Gap data")
# map the gap county data 2010
food <- read.csv("MMG2012_2010Data_ToShare.csv", fileEncoding = "UTF-8-BOM")
food %<>%
dplyr::select(FIPS, State,County..State,X2010.Food.Insecurity.Rate,X2010.Cost.Per.Meal) %>%
dplyr::rename(County = County..State, food_insecure = X2010.Food.Insecurity.Rate,
meal_cost = X2010.Cost.Per.Meal) %>%
dplyr::mutate(ID = paste0("0", as.character(050000000000 + as.numeric(FIPS))))
food$meal_cost = as.numeric(gsub("[$]","",food$meal_cost))## Warning: NAs introduced by coercion
# generate a two-factor cost of living index by adding food and housing costs
acs.fmr.food.2017 %<>%
mutate(COL = meal_percent + housing_percent)
acs.fmr.food.2018 %<>%
mutate(COL = meal_percent + housing_percent)
acs.fmr.food.2019 %<>%
mutate(COL = meal_percent + housing_percent)
acs.fmr.food.2020 %<>%
mutate(COL = meal_percent + housing_percent)
COL_2017 <- ggplot(acs.fmr.food.2017, aes(x=COL, y=poverty_percent))+
geom_point()+
geom_smooth(method = "lm", formula = y~x) +
xlab("Combined food and housing costs") +
ylab("Poverty Rate %") +
ggtitle("2017 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
COL_2018 <- ggplot(acs.fmr.food.2018, aes(x=COL, y=poverty_percent))+
geom_point()+
geom_smooth(method = "lm", formula = y~x) +
xlab("Combined food and housing costs") +
ylab("Poverty Rate %") +
ggtitle("2018 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
COL_2019 <- ggplot(acs.fmr.food.2019, aes(x=COL, y=poverty_percent))+
geom_point()+
geom_smooth(method = "lm", formula = y~x) +
xlab("Combined food and housing costs") +
ylab("Poverty Rate %") +
ggtitle("2019 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
COL_2020 <- ggplot(acs.fmr.food.2020, aes(x=COL, y=poverty_percent))+
geom_point()+
geom_smooth(method = "lm", formula = y~x) +
xlab("Combined food and housing costs") +
ylab("Poverty Rate %") +
ggtitle("2020 US Counties ") +
theme(plot.title = element_text(hjust = 0.5)) +
stat_regline_equation()
combined_COL <- cowplot::plot_grid(COL_2017,COL_2018,COL_2019,COL_2020)
combined_COL # display the combined graphNext, I make visualization to see the contribution of such costs alone verses median income. Based on these analyses, poverty rate and income are directly correlated as expected. However, median housing cost and food expenses alone do not seem to be directly associated with the poverty rate. Furthermore, calculated variables such as percent-spent-on-housing and percent-spent-on-food do not seem to give better correlations to poverty rate compared to median income alone, From these results I will use housing costs, food costs, and either poverty rate or median income from aggreate county data to predict health outcomes from BRFSS data.
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:ggpubr':
##
## mutate
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
pov_quant_2017 <- acs.fmr.food.2017 %>%
dplyr::mutate(pov_quant = quantcut(poverty_percent, q=4,by=0.25, na.rm=TRUE))
pov_meal_quart <- ggplot(data=pov_quant_2017, aes(x=med_house_income,
y=meal_cost,
color = pov_quant))+
geom_point() +
xlab("Median Household Income ($)") +
ylab("Meal Cost ($)") +
ggtitle("US Counties 2017") +
theme(plot.title = element_text(hjust = 0.5))
pov_meal_quartpov_house_quart <- ggplot(data=pov_quant_2017, aes(x=med_house_income,
y=fmr1,
color = pov_quant))+
geom_point() +
xlab("Median Household Income ($)") +
ylab("Median Rent ($)") +
ggtitle("US Counties 2017") +
theme(plot.title = element_text(hjust = 0.5))
pov_house_quart# select variables of interest
# clean BRFSS 2017 dataframe
brfss2017 <- brfss2017_raw %>%
dplyr::select('MMSANAME', 'X_MMSA', 'X_RACE', 'SEX', 'X_AGEG5YR','EDUCA','INCOME2',
'ADDEPEV2', 'CHCCOPD1', 'X_RFHLTH') %>%
dplyr::rename('Metro Area'='MMSANAME','Metro_ID'='X_MMSA','age'='X_AGEG5YR','race'='X_RACE',
'sex'='SEX', 'education'='EDUCA','income'='INCOME2',
'depression'='ADDEPEV2','copd'='CHCCOPD1', 'health'='X_RFHLTH') %>%
dplyr::filter(race %in% c(1,2,3,5,8)) %>%
dplyr::filter(sex %in% c(1,2)) %>%
dplyr::filter(age %in% seq(13)) %>%
dplyr::filter(education %in% c(1,2,3,4,5,6)) %>%
dplyr::filter(income %in% c(1,2,3,4,5,6,7,8)) %>%
dplyr::filter(depression %in% c(1,2)) %>%
dplyr::filter(copd %in% c(1,2)) %>%
dplyr::filter(health %in% c(1,2))
brfss2017 %<>%
dplyr::mutate('Metro_ID' = as.character(brfss2017$'Metro_ID')) %>%
dplyr::mutate(race = factor(race, levels = c(1,2,3,5,8),
labels = c( "White", "Black", "American Indian",
"Asian", "Hispanic"))) %>%
dplyr::mutate(sex = factor(sex, levels = c(1,2),
labels = c("Male", "Female"))) %>%
dplyr::mutate(age = factor(age, levels = seq(13),
labels = c("18-24","25-29","30-34","35-39","40-44","45-49","50-54",
"55-59","60-64","65-69","70-74","75-29",">79"))) %>%
dplyr::mutate(education = factor(education, levels = c(1,2,3,4,5,6),
labels = c("None","Elementary", "Some HS", "HS",
"Some College", "College"))) %>%
dplyr::mutate(income = factor(income, levels = c(1,2,3,4,5,6,7,8),
labels = c("10k","15k", "20k", "25k", "35k", "50k", "75k","> 75k"))) %>%
dplyr::mutate(depression = factor(depression, levels = c(1,2),
labels = c("yes", "no"))) %>%
dplyr::mutate(copd = factor(copd, levels = c(1,2),
labels = c("yes", "no"))) %>%
dplyr::mutate(health = factor(health, levels = c(1,2),
labels = c("good", "poor"))) Let’s look at the summary of the clean 2017 data frame. I am interested in sex, age, income, education, and race as demographic factors. I am interested in previous depression, COPD, and general health as outcomes. From the summary, there is a bias in sex and age. This might be explained by the fact that BRFSS is a phone based survey. It seems that income and education is skewed twoards highly educated, high earning individuals as well. The BRFSS might be biased towards home-owners and landline users over cell-phone users.
## Metro Area Metro_ID race sex age education income depression copd health
## Length:176937 Length:176937 White :138185 Male :79948 18-24: 9372 None : 210 10k : 7348 yes: 35853 yes: 13249 good:146650
## Class :character Class :character Black : 18934 Female:96989 25-29: 9552 Elementary : 3202 15k : 7613 no :141084 no :163688 poor: 30287
## Mode :character Mode :character American Indian: 1806 30-34:10675 Some HS : 6744 20k :11613
## Asian : 385 35-39:11531 HS :41713 25k :14631
## Hispanic : 17627 40-44:11104 Some College:48317 35k :17058
## 45-49:13113 College :76751 50k :23933
## 50-54:15770 75k :27922
## 55-59:18469 > 75k:66819
## 60-64:19574
## 65-69:19402
## 70-74:15710
## 75-29:10526
## >79 :12139
cbsa2fips <- read.csv("cbsa2fipsxw.csv")
cbsa2fips %<>%
dplyr::mutate(fipsstatecode = as.character(cbsa2fips$fipsstatecode)) %>%
dplyr::mutate(fipscountycode = as.character(cbsa2fips$fipscountycode)) %>%
dplyr::mutate(fips_state = ifelse(nchar(fipsstatecode) == 1, paste0("0",fipsstatecode),
fipsstatecode)) %>%
dplyr::mutate(fips_county = ifelse(nchar(fipscountycode) == 2, paste0("0",fipscountycode),
ifelse(nchar(fipscountycode) == 1, paste0("00",fipscountycode), fipscountycode))) %>%
dplyr::mutate(ID = paste0("0", as.character(050000000000 + as.numeric(paste0(fips_state,fips_county))))) %>%
dplyr::rename('Metro_ID' = cbsacode) %>%
dplyr::select(ID, 'Metro_ID')
cbsa2fips$`Metro_ID` = as.character(cbsa2fips$`Metro_ID`)
# associate all counties to given metro area
acs.fmr.food.2017 %<>%
left_join(cbsa2fips, by="ID")# function to aggregate all acs.fmr.food cost of living data by metro area
# returns mean for all counties in a given metro area
metro_costs <- function(df){
df %<>%
dplyr::select(fmr0, fmr1, fmr2, fmr3, fmr4, Metro_ID, poverty_percent,
housing_percent, meal_cost, meal_percent, med_house_income)
df = as.data.frame(df)
df %<>%
dplyr::select(-geometry) %>%
group_by(Metro_ID) %>%
summarise_at(vars('fmr0', 'fmr1', 'fmr2', 'fmr3', 'fmr4', 'poverty_percent',
'housing_percent', 'meal_cost', 'meal_percent',
'med_house_income'), mean)
return(df)
}
col_2017 <- metro_costs(acs.fmr.food.2017)
# join 2017 metro cost of living to BRFSS2017 dataframe
brfss2017 %<>%
left_join(col_2017, by = "Metro_ID")
# remove NAs
brfss2017 %<>%
na.omit(brfss2017)metro_income_ghealth <- ggplot(data=brfss2017, aes(x=health, y = med_house_income))+
geom_boxplot() +
xlab("General Health") +
ylab("Median Household Income ($)") +
ggtitle("BRFSS2017 General Health") +
theme(plot.title = element_text(hjust = 0.5))
metro_income_copd <- ggplot(data=brfss2017, aes(x=copd, y = med_house_income))+
geom_boxplot() +
xlab("COPD History") +
ylab("Median Household Income ($)") +
ggtitle("BRFSS2017 COPD History") +
theme(plot.title = element_text(hjust = 0.5))
metro_income_depression <- ggplot(data=brfss2017, aes(x=depression, y = med_house_income))+
geom_boxplot() +
xlab("Depression History") +
ylab("Median Household Income ($)") +
ggtitle("BRFSS2017 Depression") +
theme(plot.title = element_text(hjust = 0.5))
metro_income_health_outcomes <- cowplot::plot_grid(metro_income_ghealth,metro_income_copd,
metro_income_depression)
metro_income_health_outcomesmetro_fmr1_ghealth <- ggplot(data=brfss2017, aes(x=health, y = fmr1))+
geom_boxplot() +
xlab("General Health") +
ylab("Median 1-Bed Rent ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_fmr1_copd <- ggplot(data=brfss2017, aes(x=copd, y = fmr1))+
geom_boxplot() +
xlab("COPD History") +
ylab("Median 1-Bed Rent ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_fmr1_depression <- ggplot(data=brfss2017, aes(x=depression, y = fmr1))+
geom_boxplot() +
xlab("Depression History") +
ylab("Median 1-Bed Rent ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_rent_health_outcomes <- cowplot::plot_grid(metro_fmr1_ghealth,metro_fmr1_copd,
metro_fmr1_depression)
metro_rent_health_outcomesmetro_food_ghealth <- ggplot(data=brfss2017, aes(x=health, y = meal_cost))+
geom_boxplot() +
xlab("General Health") +
ylab("Cost of Meal ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_food_copd <- ggplot(data=brfss2017, aes(x=copd, y = meal_cost))+
geom_boxplot() +
xlab("COPD History") +
ylab("Cost of Meal ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_food_depression <- ggplot(data=brfss2017, aes(x=depression, y = meal_cost))+
geom_boxplot() +
xlab("Depression History") +
ylab("Cost of Meal ($)") +
theme(plot.title = element_text(hjust = 0.5))
metro_food_health_outcomes <- cowplot::plot_grid(metro_food_ghealth,metro_food_copd,
metro_food_depression)
metro_food_health_outcomesmetro_pov_ghealth <- ggplot(data=brfss2017, aes(x=health, y = poverty_percent))+
geom_boxplot() +
xlab("General Health") +
ylab("Poverty Rate") +
theme(plot.title = element_text(hjust = 0.5))
metro_pov_copd <- ggplot(data=brfss2017, aes(x=copd, y = poverty_percent))+
geom_boxplot() +
xlab("COPD History") +
ylab("Poverty Rate") +
theme(plot.title = element_text(hjust = 0.5))
metro_pov_depression <- ggplot(data=brfss2017, aes(x=depression,
y= poverty_percent))+
geom_boxplot() +
xlab("Depression History") +
ylab("Poverty Rate") +
theme(plot.title = element_text(hjust = 0.5))
metro_pov_health_outcomes <- cowplot::plot_grid(metro_pov_ghealth,metro_pov_copd,
metro_pov_depression)
metro_pov_health_outcomes# income and general health
ggplot(data=brfss2017, aes(x=income, fill=health)) +
geom_bar(position="dodge") +
ggtitle('Individual Income and General Health')+
xlab('Income')+
ylab('Count') +
theme(plot.title = element_text(hjust = 0.5)) # income and depression
ggplot(data=brfss2017, aes(x=income, fill=depression)) +
geom_bar(position="dodge") +
ggtitle('Individual Income and Depression History')+
xlab('Income')+
ylab('Count') +
theme(plot.title = element_text(hjust = 0.5)) # income and COPD
ggplot(data=brfss2017, aes(x=income, fill=copd)) +
geom_bar(position="dodge") +
ggtitle('Individual Income and COPD History')+
xlab('Income')+
ylab('Count') +
theme(plot.title = element_text(hjust = 0.5)) health outcomes.
# chi-square test for individual income and general health
brfss2017_health_table <- table(brfss2017$health, brfss2017$income)
brfss2017_health_table##
## 10k 15k 20k 25k 35k 50k 75k > 75k
## good 3024 3345 5988 8708 10974 16542 20359 47235
## poor 2467 2603 3159 3343 3118 3340 2690 3325
##
## Pearson's Chi-squared test
##
## data: brfss2017_health_table
## X-squared = 13558, df = 7, p-value < 0.00000000000000022
# chi-square test for individual income and depression history
brfss2017_depression_table <- table(brfss2017$depression, brfss2017$income)
brfss2017_depression_table##
## 10k 15k 20k 25k 35k 50k 75k > 75k
## yes 2083 2196 2635 3141 3157 4095 4466 7479
## no 3408 3752 6512 8910 10935 15787 18583 43081
##
## Pearson's Chi-squared test
##
## data: brfss2017_depression_table
## X-squared = 3625.9, df = 7, p-value < 0.00000000000000022
# chi-square test for individual income and COPD history
brfss2017_COPD_table <- table(brfss2017$copd, brfss2017$income)
brfss2017_COPD_table##
## 10k 15k 20k 25k 35k 50k 75k > 75k
## yes 988 1234 1360 1492 1374 1630 1343 1553
## no 4503 4714 7787 10559 12718 18252 21706 49007
##
## Pearson's Chi-squared test
##
## data: brfss2017_COPD_table
## X-squared = 4926.9, df = 7, p-value < 0.00000000000000022
# Train a logistic regression model using BRFSS2017 data
depression_glm <- glm(formula=depression ~ race + sex + age + education + income,
data=brfss2017, family = binomial)
health_glm <- glm(formula=health ~ race + sex + age + education + income,
data=brfss2017, family = binomial(logit))
copd_glm <- glm(formula=copd ~ race + sex + age + education + income,
data=brfss2017, family = binomial(logit))library(PRROC)
# ROC for depression logistic regression
depression_pred <- predict(depression_glm, data=brfss2017, type="response")
depression_glm_roc <- roc.curve(depression_pred[brfss2017$depression == "no"],
depression_pred[brfss2017$depression == "yes"], curve=TRUE)
# ROC for general health logistic regression
health_pred <- predict(health_glm, data=brfss2017, type="response")
health_glm_roc <- roc.curve(health_pred[brfss2017$health == "poor"],
health_pred[brfss2017$health == "good"], curve=TRUE)
# ROC for COPD logistic regression
copd_pred <- predict(copd_glm, data=brfss2017, type="response")
copd_glm_roc <- roc.curve(copd_pred[brfss2017$copd == "no"],
copd_pred[brfss2017$copd == "yes"], curve=TRUE)
plot(depression_glm_roc)# Train a logistic regression model using BRFSS2017 data
depression_glm2 <- glm(formula=depression ~ race + sex + age + education + income +
med_house_income + fmr1 + meal_cost,
data=brfss2017, family = binomial(logit))
health_glm2 <- glm(formula=health ~ race + sex + age + education + income + med_house_income
+ fmr1 + meal_cost,
data=brfss2017, family = binomial(logit))
copd_glm2 <- glm(formula=copd ~ race + sex + age + education + income + med_house_income
+ fmr1 + meal_cost,
data=brfss2017, family = binomial(logit))# ROC for depression logistic regression
depression_pred2 <- predict(depression_glm2, data=brfss2017, type="response")
depression_glm_roc2 <- roc.curve(depression_pred2[brfss2017$depression == "no"],
depression_pred2[brfss2017$depression == "yes"], curve=TRUE)
head(depression_pred2)## 1 2 3 4 5 6
## 0.7964293 0.8271381 0.6311824 0.5844652 0.8867128 0.8109481
# ROC for general health logistic regression
health_pred2 <- predict(health_glm2, data=brfss2017, type="response")
health_glm_roc2 <- roc.curve(health_pred2[brfss2017$health == "poor"],
health_pred2[brfss2017$health == "good"], curve=TRUE)
# ROC for COPD logistic regression
copd_pred2 <- predict(copd_glm2, data=brfss2017, type="response")
copd_glm_roc2 <- roc.curve(copd_pred2[brfss2017$copd == "no"],
copd_pred2[brfss2017$copd == "yes"], curve=TRUE)
plot(depression_glm_roc2)Overall, this might because the the aggregate data is not granular enough. In order to do this analysis, multiple counties had to be reassigned to one metro label. However, cost-of-living expenses can vary greatly depending not only on which city one is located in but also which county and which part of the county one lives in. Thus, averaging housing and food expenses from multiple geographical areas does not provide fine enough detail to how one’s indiviual income might be affect by cost of living. In order to better improve these models, surveys should ask individuals for their housing costs or provide more granular geospatial labelling for individuals.